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ABSTRACT 

Model selection is the problem of distinguishing competing models, perhaps featuring differ- 
ent numbers of parameters. The statistics literature contains two distinct sets of tools, those 
based on information theory such as the Akaike Information Criterion (AIC), and those on 
Bayesian inference such as the Bayesian evidence and Bayesian Information Criterion (BIC). 
The Deviance Information Criterion combines ideas from both heritages; it is readily com- 
puted from Monte Carlo posterior samples and, unlike the AIC and BIC, allows for parameter 
degeneracy. I describe the properties of the information criteria, and as an example compute 
them from WMAP3 data for several cosmological models. I find that at present the informa- 
tion theory and Bayesian approaches give significantly different conclusions from that data. 
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1 INTRODUCTION 

Although it has been widely recognized only recently, model selec- 
tion problems are ubiquitous in astrophysics and cosmology. While 
parameter estimation seeks to determine the values of a parame- 
ter set chosen by hand, model selection seeks to distinguish be- 
tween competing choices of parameter set. A considerable body 
of statistics literature is devoted to model selection [excellent text- 
book accounts are given by Jeffreys 1961, Burnham & Anderson 
2002, MacKay 2003, and Gregory 2005] and its use is widespread 
throughout many branches of science. For a non-technical overview 
of model selection as applied to cosmology, see Liddle, Mukherjee 
& Parkinson (2006a), and for an overview of techniques and appli- 
cations see Lasenby & Hobson (2006). 

In general, a model is a choice of parameters to be varied and 
a prior probability distribution on those parameters. The goal of 
model selection is to balance the quality of fit to observational data 
against the complexity, or predictiveness, of the model achieving 
that fit. This tension is achieved through model selection statistics, 
which attach a number to each model enabling a rank-ordered list 
to be drawn up. Typically, the best model is adopted and used for 
further inference such as permitted parameter ranges, though the 
statistics literature has also seen increasing interest in multi-model 
inference combining a number of adequate models (e.g. Hoeting et 
al. 1999; Burnham & Anderson 2004). 

There are two main schools of thought in model selection. 
Bayesian inference, particularly as developed by Jeffreys culmi- 
nating in his classic textbook (Jeffreys 1961) and by many oth- 
ers since, can assign probabilities to models as well as to param- 
eter values, and manipulate these probabilities using rules such 
as Bayes' theorem. Information-theoretic methods, pioneered by 
Akaike (1974) with his Akaike Information Criterion, instead focus 
on the Kullback-Leibler information entropy (Kullback & Leibler 



1951) as a measure of information lost when a particular model is 
used in place of the (unknown) true model. Variants on this latter 
theme include the Takeuchi Information Criterion (TIC, Takeuchi 
1976), which extends the AIC by droppinging the assumption that 
the model set considered includes the true model. Bayesian statis- 
tics include the Bayesian evidence and an approximation to it 
known as the Bayesian Information Criterion (BIC, Schwarz 1978), 
which, despite the name, does not have an information-theoretic 
justification. 

Given the plethora of possible statistics, one might despair as 
to which to use, especially if they give conflicting results. Cosmolo- 
gists, in particular, tend to ally themselves with a Bayesian method- 
ology, for example the use of Markov Chain Monte Carlo (MCMC) 
methods to carry out parameter likelihood analyses, and are there- 
fore tempted to adopt methods advertised as such. However, even 
if one were to side automatically against frequentist approaches, 
the situation does not appear that clear cut; Burnham & Anderson 
(2004) have argued that the AIC can be derived in a Bayesian way 
(and the BIC in a frequentist one), and that one should not casually 
dismiss a criterion soundly grounded in information theory. 

Nevertheless, in my view the Bayesian evidence is the pre- 
ferred tool; in Bayesian inference it is precisely the quantity which 
updates the prior model probability to the posterior model proba- 
bility, and has an unambiguous interpretation in these probabilistic 
terms. The problem with the evidence is the difficulty in calculating 
it to the required accuracy, though the situation there has improved 
with the development of the nested sampling algorithm (Skilling 
2006) and its implementation for cosmology in the CosmoNest 
code (Mukherjee, Parkinson & Liddle 2006; Parkinson, Mukherjee 
& Liddle 2006). This paper is principally directed at circumstances 
where the evidence is not readily calculable, and a simpler model 
selection technique is required. 

In this article I describe and apply an additional information 
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criterion, the Deviance Information Criterion (DIC) of Spiegelhal- 
ter et al. (2002, henceforth SBCL02), which combines heritage 
from both Bayesian methods and information theory. It has inter- 
esting properties. Firstly, unlike the AIC and BIC it accounts for the 
situation, common in astrophysics, where one or more parameters 
or combination of parameters is poorly constrained by the data. 
Secondly, it is readily calculable from posterior samples, such as 
those generated by MCMC methods. It has already been used in 
astrophysics to study quasar clustering (Porciani & Norberg 2006). 



2 MODEL SELECTION STATISTICS 
2.1 Bayesian evidence 

The Bayesian evidence, also known as the model likelihood and 
sometimes, less accurately, as the marginal likelihood, comes from 
a full implementation of Bayesian inference at the model level, and 
is the probability of the data given the model. Using Bayes theorem, 
it updates the prior model probability to the posterior model prob- 
ability. Usually the prior model probabilities are taken as equal, 
but quoted results can readily be rescaled to allow for unequal 
ones if required (e.g. Lasenby & Hobson 2006). In many circum- 
stances the evidence can be calculated without simplifying assump- 
tions (though perhaps with numerical errors). It has now been quite 
widely applied in cosmology; see for example Jaffe (1996), Hob- 
son, Bridle & Lahav (2002), Saini, Weller & Bridle (2004), Trotta 
(2005), Parkinson et al. (2006), and Lasenby & Hobson (2006). 
The evidence is given by 



E ee / C(6)P{6)d9, 



(1) 



where 9 is the vector of parameters being varied in the model and 
P(6) is the properly-normalized prior distribution of those parame- 
ters (often chosen to be flat). It is the average value of the likelihood 
C over the entire model parameter space that was allowed before 
the data came in. It rewards a combination of data fit and model 
predictiveness. Models which fit the data well and make narrow 
predictions are likely to fit well over much of their available pa- 
rameter space, giving a high average. Models which fit well for 
particular parameter values, but were not very predictive, will fit 
poorly in most of their parameter space driving the average down. 
Models which cannot fit the data well will do poorly in any event. 

The integral in equation (fTJ may however be difficult to cal- 
culate, as it may have too many dimensions to be amenable to 
evaluation by gridding, and the simplest MCMC methods such as 
Metropolis-Hastings produce samples only in the part of parameter 
space where the posterior probability is high rather than through- 
out the prior. Nevertheless, many methods exist (e.g. Gregory 2005; 
Trotta 2005), and the nested sampling algorithm (Skilling 2006) 
has proven feasible for many cosmology applications (Mukherjee 
et al. 2006; Parkinson et al. 2006; Liddle et al. 2006b). 

A particular property of the evidence worth noting is that it 
does not penalize parameters (or, more generally, degenerate pa- 
rameter combinations) which are unconstrained by the data. If the 
likelihood is flat or nearly flat in a particular direction, it simply 
factorizes out of the evidence integral leaving it unchanged. This is 
an appealing property, as it indicates that the model fitting the data 
is doing so really by varying fewer parameters than at first seemed 
to be the case, and it is the unnecessary parameters that should be 
discarded, not the entire model. 



2.2 AIC and BIC 

Much of the literature, both in astrophysics and elsewhere, seeks a 
simpler surrogate for the evidence which still encodes the tension 
between fit and model complexity. In Liddle (2004), I described 
two such statistics, the AIC and BIC, which have subsequently been 
quite widely applied to astrophysics problems. They are relatively 
simple to apply because they require only the maximum likelihood 
achievable within a given model, rather than the likelihood through- 
out the parameter space. Of course, such simplification comes at a 
cost, the cost being that they are derived using various assumptions, 
particularly gaussianity or near-gaussianity of the posterior distri- 
bution, that may be poorly respected in real-world situations. 
The AIC is defined as 



AIC ee -21n£ max + 2k, 



(2) 



where £ max is the maximum likelihood achievable by the model 
and k the number of parameters of the model (Akaike 1974). The 
best model is the one which minimizes the AIC, and there is no 
requirement for the models to be nested. The AIC is derived by 
an approximate minimization of the Kullback-Leibler information 
entropy, which measures the difference between the true data dis- 
tribution and the model distribution. An explanation geared to as- 
tronomers can be found in Takeuchi (2000), while the full statistical 
justification is given by Burnham & Anderson (2002). 

The BIC was introduced by Schwarz (1978), and is defined as 



BICEE-21n£ max + fclniV : 



(3) 



where N is the number of datapoints used in the fit. It comes from 
approximating the evidence ratios of models, known as the Bayes 
factor (Jeffreys 1961; Kass & Raftery 1995). The BIC assumes that 
the datapoints are independent and identically distributed, which 
may or may not be valid depending on the dataset under considera- 
tion (e.g. it is unlikely to be good for cosmic microwave anisotropy 
data, but may well be for supernova luminosity-distance data). 

Applications of these two criteria have usually shown broad 
agreement in the conclusions reached, but occasional differences 
in the detailed ranking of models. One should consider the extent 
to which the conditions used in the derivation of the criteria are vio- 
lated in real situations. A particular case in point is the existence of 
parameter degeneracies; inclusion (inadvertent or otherwise) of un- 
constrained parameters is penalized by the AIC and BIC, but not by 
the evidence. Interpretation of the BIC as an estimator of evidence 
differences is therefore suspect in such cases. 

Burnham & Anderson (2002, 2004) have stressed the impor- 
tance of using a version of the AIC corrected for small sample sizes, 
AICc- This is given by (Sugiura 1978) 



AICc = AIC + 



2k(k - 



N - 



1 



(4) 



Since the correction term anyway disappears for large sample sizes, 
N ^> k, there is no reason not to use it even in that case, i.e. it is 
always preferable to use AIC C rather than the original AIC. In typ- 
ical small-sample cases, e.g. N/k being only a few, the correction 
term strengthens the penalty, bringing the AIC C towards the BIC 
and potentially mitigating the difference between them. 



2.3 DIC 

The DIC was introduced by SBCL02. It has already been widely 
applied outside of astrophysics. Its starting point is a definition of 
an effective number of parameters pr> of a model. This quantity, 
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known also as the Bayesian complexity, has already been intro- 
duced into astrophysics by Kunz, Trotta & Parkinson (2006), with 
focus on assessing the number of parameters that can be usefully 
constrained by a particular dataset. 
It is defined by 

po =D(0) -D(6), where D(6) = -21n£(0) +C. (5) 

Here C is a 'standardizing' constant depending only on the data 
which will vanish from any derived quantity, and D is the deviance 
of the likelihood. The bars indicate averages over the posterior dis- 
tribution. In words, then, po is the mean of the deviance, minus the 
deviance of the mean. If we define an effective chi-squared as usual 
by x 2 — — 2 In C, we can write 

PD = *m-x 2 {o)- (6) 

Its intent becomes clear from studying a simple one- 
dimensional example, in which the likelihood is a gaussian of zero 
mean and width a, i.e. InC = A — x 2 /2a 2 , and where the prior 
distribution is flat with width aa. Care is needed to properly nor- 
malize the posterior, which relates the likelihood amplitude A to 
the prior width. In the limit where a 3> 1, so that the posterior 
is well confined within the prior, one finds po = 1 (in this case, 
the averaging is just evaluating the variance of the distribution, but 
in units of that variance). This corresponds to a well-measured pa- 
rameter. If instead a -C 1, so that the data are unable to constrain 
the parameter, then po — > since x 2 becomes independent of x. 
Hence po indicates the number of parameters actually constrained 
by the data. Extension of the above argument to an iV-dimensional 
gaussian, potentially with covariance, indicates po = N if all di- 
mensions are well contained within the prior, and po < N other- 
wise (SBCL02; Kunz et al. 2006). 

One issue of debate in the statistics literature is the choice of 
the mean parameter value in the definition of po - One could alter- 
natively argue for the maximum likelihood in its place. This choice 
affects the possible reparametrization dependence of the statistic 
(SBCL02; Celeux et al. 2006). It may be that the best choice de- 
pends on the situation under study (e.g. the mean parameter value 
will be a poor choice if the likelihood has distinct strong peaks). 

The DIC is then defined as 

DIC = D(e) + 2p D = D(9)+p D . (7) 

The first expression is motivated by the form of the AIC, replac- 
ing the maximum likelihood with the mean parameter likelihood, 
and the number of parameters with the effective number. It can 
therefore be justified on information/decision theory grounds, as 
discussed by SBCL02. The second form is interesting because the 
mean deviance can be justified in Bayesian terms, which always 
deal with model-averaged quantities rather than maximum values. 
The DIC has two attractive properties: 

(i) It is determined by quantities readily obtained from Monte 
Carlo posterior samples. One simply averages the deviances over 
the samples. If the calculation is being done by whoever generated 
the chains, they can obtain the deviance at the mean with a single 
extra likelihood call, but even if using chains generated by others, 
it should be fine to use the sample closest to that mean value as the 
estimator, especially bearing in mind the possibility that the mode 
could have been used in place of the mean. The calculation is also 
easily done with posterior samples generated by nested sampling, 
which have non-integer weights (Parkinson et al. 2006). 

(ii) By using the effective number of parameters, the DIC over- 
comes the problem of the AIC and BIC that they do not discount 
parameters which are unconstrained by the data. 



Note that in the case of well-constrained parameters, the DIC 
approaches the AIC and not the BIC, since D{6) — > — 21n£ max 
and po — ► k. It is plausible to believe that it too can be corrected 
for small dataset sizes using the same formula that leads to AIC C , 
though to my knowledge there is currently no proof of this. 



2.4 Other criteria 

In addition to those already mentioned, the literature contains many 
other information criteria, but mostly sharing the heritage of those 
above. The TIC (Takeuchi 1976) generalizes the AIC by dropping 
the assumption that the true model is in the set considered, but in 
practice is hard to compute and, where computation has been car- 
ried out, tends to give results very similar to the AIC (Burnham 
& Anderson 2002, 2004). A Bayesian version of the AIC, the Ex- 
pected AIC (EAIC), where one takes its expected value over the 
posterior distribution rather than evaluating at the maximum, has 
been proposed (by Brooks in the comments to SBCL02) but does 
not appear to have been significantly applied. 

Other information criteria, which appear to have been less 
widely used, include the Network Information Criterion (NIC), the 
Subspace Information Criterion (SIC, though this abbreviation is 
sometimes used for Schwarz Information Criterion as another name 
for the BIC), and the Generalized Information Criterion (GIC). The 
DIC also comes in many variants, see e.g. Celeux et al. (2006). 

An interesting variant was proposed by Sorkin (1983), using 
a Turing machine construction to define an entropy associated with 
the theory to be used as a penalty term. This was recently applied 
to cosmological data by Magueijo & Sorkin (2006). It has not been 
picked up by the statistics community, but may be related to the 
widely-used minimum message length paradigm (Wallace & Boul- 
ton 1968; Wallace 2005). The idea of interpetting the best model as 
the one offering maximal algorithmic compression of the data goes 
all the way back to late 17th century writings by Leibniz. 



2.5 Dimensional consistency and model selection philosophy 

Dimensional consistency refers to the behaviour of the model se- 
lection statistics in the limit of arbitrarily large datasets. The BIC 
and evidence are dimensionally consistent, meaning that if one of 
the considered models is true, they give 100 per cent support to that 
model as the dataset becomes large. As a necessary consequence, 
however, they will give 100 per cent support to the best model even 
if it is not true. By contrast, the AIC is dimensionally inconsistent 
(Kashyap 1980), sharing its support around the models even with 
infinite data. As the DIC approaches the AIC in the limit of large 
datasets, it too is dimensionally inconsistent (SBCL02). 

Dimensional consistency does not seem to particularly bother 
most statisticians, as they are typically seeking models which can 
explain data and have some predictive power, rather than expect- 
ing to represent some underlying truth. Indeed, they commonly 
quote statistician George Box: "All models are wrong, but some 
are useful." The problem of dimensional consistency is therefore 
mitigated, because they do not expect the set of models to remain 
static as the dataset evolves. Cosmologists, however, are probably 
not yet willing to concede that they might be looking for something 
other than absolute truth specified by a finite number of parame- 
ters. Combining this line of argument with statements above, this 
implies that the Bayesian evidence indeed is the preferred choice 
for cosmological model selection when it can be calculated. 
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Table 1. Results for comparison of different models to WMAP3 data. The differences are quoted with respect to the first model. Negative is preferred. 



Model 


Parameters k 


PD 


-21n£(6>) 


DIC 


2 In £max 


ADIC 


AAIC C 


ABIC 


Base+Agz 


6 


5.2 


11262.6 


11272.9 


11262.2 











Base+ng 


6 


6.3 


11253.3 


11265.9 


11252.5 


-7.0 


-9.7 


-9.7 


Base+Agz+ng 


7 


5.6 


11253.0 


11264.1 


11252.6 


-8.8 


-7.6 


-2.3 


Base+Agz+ng+r 


8 


5.4 


11254.2 


11265.0 


11252.6 


-7.9 


-5.6 


+5.0 


Base+Agz +n S +running 


8 


6.2 


11250.0 


11262.3 


11249.0 


-10.6 


-9.2 


+1.4 



3 INFORMATION CRITERIA FOR WMAP3 

I now apply the information criteria to WMAP3 model fits as com- 
piled by the WMAP team on LAMBDA0 The DIC calculation is 
straightforward. The 8 chains for each cosmology are concatenated, 
the mean deviance found by averaging the likelihoods, and the de- 
viance at the mean estimated by finding the MCMC point located 
closest to the mean (where the distance in each parameter direction 
was measured in units of the standard deviation of that parameter). 

I also quote the values of the differences in AIC C and BIC, 
where the maximum likelihood is taken directly from the most 
likely posterior sample (in principle this may slightly disadvantage 
models with more parameters, for which the most likely sample 
will typically be slightly further from the true maximum, though 
for the WMAP3 sample sizes this effect will be small). I take to 
be the number of power spectrum datapoints, A?wmap3 = 1448 
(Spergel et al. 2006), this choice to be discussed further below 
(nothing changes significantly if a slightly larger number ~ 3000 is 
used to allow for the pixel-based treatment of the low-£ likelihood). 
With this large value, AAIC and AAIC C are indistinguishable. 

The available model fits unfortunately do not quite cover all 
cases that might be of interest. All well-fitting models vary five 
standard parameters, being the physical baryon density Q^h 2 , the 
physical cold dark matter density Q c h 2 , the sound horizon 6, the 
perturbation amplitude ln(10 10 Ag), and the optical depth r (the 
Hubble constant and dark energy density are derived parameters). 
However fits varying just these parameters, a Harrison-Zel'dovich 
model suggested as the best model from first-year WMAP data in 
Liddle (2004), are not available. Nevertheless, I will refer to this 
as the Base model. Instead, there are two different six-parameter 
models, one adding the spectral index ng and one adding the phe- 
nomenological Sunyaev-Zel'dovich (SZ) marginalization parame- 
ter Agz (Spergel et al. 2006). All further available models include 
Agz; extra parameters that I then consider are the spectral index 
ns (giving the standard ACDM model), further addition of tensors 
r to give the standard slow-roll inflation model, and inclusion of 
spectral index running (without tensors). 

The main subtlety is the inclusion of Agz - This is poorly con- 
strained by the data and hence is not expected to contribute fully to 
po\ nevertheless the likelihood does have some dependence on it 
and it must be included in the analysis that determines the deviance 
at the mean. Of the parameters considered, Asz and r are phe- 
nomenological parameters which, at least in principle though not 
yet in practice, can be determined from the others. The remaining 
four are truly independent according to present understanding. 

The uncertainty in the DIC may not be well estimated by an- 
alyzing subsamples, as with smaller samples the mean deviance 
will be less well estimated by the nearest point. Instead I estimated 



the uncertainty by employing bootstrap resamples of the combined 
sample list. This showed that the statistical accuracy was limited by 
the accuracy with which the ln£ values were stored, ±0.1 corre- 
sponding to ±0.2 in the DIC. As this is a much smaller uncertainty 
than the level at which differences are significant, the statistical un- 
certainty in the determination of the DIC is negligible. 

The results are shown in Table Q] The po values are in good 
agreement with expectation. Kunz et al. (2006) computed po for 
several models using a compilation of microwave anisotropy data 
including WMAP3, and always found pn close to the input num- 
ber of parameters. However they ran their own chains and did not 
include the poorly-constrained parameters Agz and r. Models in- 
cluding those parameters return apo significantly less than k. 

While only the Bayesian evidence has the full interpretation 
as the model likelihood, leading to the posterior model probability, 
the AIC has also been interpretted as a model likelihood by defining 
Akaike weights (Akaike 1981; Burnham & Anderson 2004) 



exp(-AAIC c ,i/2) 
£? =1 e X p(-AAIC c , r /2) 



(8) 



1 Legacy Archive for 
http://lambda.gsfc.nasa.gov 



Microwave Background Data Analysis: 
Chains were downloaded in December 
2006. The subsequent January 2007 update does not allow model selection 
as the chains were not all generated with the same likelihood code. 



where there are R models and the differences are with respect to 
any one. The same interpretation can be given to the DIC differ- 
ences (SBCL02). For the BIC, insofar as it well approximates twice 
the log of the Bayes factor, it too can be interpreted as a model like- 
lihood. By convention significance is then judged on the Jeffreys' 
scale, which rates AIC > 5 as 'strong' and AIC > 10 as 'de- 
cisive' evidence against the model with higher criterion value. If 
the interpretation as model likelihoods holds, these points corre- 
spond to odds ratios of approximately 13:1 and 150:1 against the 
weaker model. As with the evidence, these likelihoods can be fur- 
ther weighted by a prior model probability if desired. 

Recall that the DIC, like the AIC, is motivated from informa- 
tion theory, while the BIC is not. Indeed, we see that the DIC re- 
sults quite closely follow the AIC results; both argue quite strongly 
against the Base+Agz model, but are then rather inconclusive 
amongst the remaining models. So information theory methods are 
neither for nor against inclusion of extra parameters such as r and 
running at this stage. Incidentally, we can also see that if the DIC 
were defined using C max rather than C(8), little difference would 
have arisen in this comparison. 

The information criteria indicate that WMAP3 has put the 
Harrison-Zel'dovich model (with SZ marginalization) under con- 
siderable, if not yet conclusive, pressure. This is in accord with the 
conclusions reached by Spergel et al. (2006) using chi-squared per 
degree of freedom arguments, though the information criterion give 
weaker support to this conclusion by recognizing model dimen- 
sionality. The strength of conclusion against Harrison-Zel'dovich 
could also be weakened by various systematic effects in data anal- 
ysis choices, e.g. inclusion of gravitational lensing (Lewis 2006), 
beam modelling (Peiris & Easther 2006), and point-source subtrac- 
tion (Eriksen et al. 2006; Huffenberger, Eriksen & Hansen 2006). 

By contrast, Bayesian approaches do not put ns = 1 under 
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any kind of pressure. Parkinson et al. (2006) found that the full ev- 
idences for the Base model and Base+ns were indistinguishable 
with WMAP3 alone, and still inconclusive with inclusion of other 
datasets. However that analysis did not include SZ marginalization, 
and so the equivalent comparison cannot be made here. However 
the BIC comparison between those models each with Asz added 
does not show any strong preference, and it seems a safe bet that 
had the Base model itself been supplied by WMAP3, its BIC differ- 
ence compared with Base+ns, the best model in the set as judged 
by the BIC, would not have been significant. 

Further, while the information theory methods are ambivalent 
about r and running, the BIC argues rather strongly against, espe- 
cially in the case of tensors which offer no improvement at all in 
data-fitting. Full evidence calculations however show that this con- 
clusion is quite prior dependent (Parkinson et al. 2006). 

That the two methods give such different answers is due to 
the way that prior assumptions are treated, in particular the prior 
widths of the parameter ranges. The AIC does not care about this 
at all, and the DIC only cares while the data is weak enough that 
some prior information on the parameter distribution remains. By 
contrast, in Bayesian model comparison the prior width is a key 
concept, determining the predictiveness of the model. For the evi- 
dence this is reflected in the domain of integration over which the 
likelihood is averaged, while for the BIC it is in the dependence on 
the amount of data. Cosmologists are in the fortunate position that 
for many parameters the likelihood is highly compressed within 
reasonable priors, forcing a discrepancy between information the- 
ory and Bayesian results. This discrepancy will be further enhanced 
in the future if the data continue to improve without requiring evo- 
lution in the model dataset, i.e. the problem of dimensional incon- 
sistency of the AIC/DIC may already be with us. 

Concerning the inclusion of Asz in models, it is clear that 
Bayesian methods don't like including it as a fit parameter, since 
it is poorly constrained and does not significantly improve the fit. 
However the SZ effect is certainly predicted to be in the data at 
some level, though it ought to be derived from the other parameters 
rather than fit. It is tempting to try to deal with this by using po in 
the BIC rather than k, but there is no existing justification for doing 
so. The same issue does not arise with the optical depth, also a 
derived parameter, as it is well constrained by the data in all models. 

In computing the BIC above, I adopted the number of data- 
points literally. This may not always be the best choice: the deriva- 
tion of the BIC requires the data to be independent and identically 
distributed, and it may be that this can be better achieved by bin- 
ning the data in some suitable way. However to do so would require 
a whole new likelihood analysis for the binned data, counter to the 
desire here that the methods should be applicable to pre-existing 
posterior samples. In any case there does not appear to be any well- 
defined way to judge how much binning, if any, is desirable. 

Finally, I note that while here it is the BIC which appears to 
behave most like the evidence, in their quasar clustering studies 
Porciani & Norberg (2006) found that the DIC was the only cri- 
terion to give precisely the same model ranking order and level of 
inconclusiveness as the Bayes factors, with the BIC underfitting. 



4 SUMMARY 

I have described several information criteria that can be used for as- 
trophysical model selection, representing the rival strands of infor- 
mation theory and Bayesian inference. In application to WMAP3 
data, the DIC behaves rather similarly to the AIC, despite the pres- 



ence of parameter degeneracies. The conclusions one would draw 
from those statistics are rather different from those indicated by 
Bayesian methods, either the full evidence as computed in Parkin- 
son et al. (2006) or the BIC as calculated in this article. 
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